Public Transportation Accessibility in Twin Cities

Author

Tina Chen, Shirley Jiang, Cynthia Zhang

Published

April 16, 2024

source('cleaning.R')

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ ggplot2   3.4.4     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE


Attaching package: 'Hmisc'


The following objects are masked from 'package:dplyr':

    src, summarize


The following objects are masked from 'package:base':

    format.pval, units



Attaching package: 'janitor'


The following objects are masked from 'package:stats':

    chisq.test, fisher.test
Reading layer `school_program_locations' from data source 
  `/Users/apple/Documents/GitHub/212/STAT212_final_project/data/shp_struc_school_program_locs' 
  using driver `ESRI Shapefile'
Simple feature collection with 5872 features and 24 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 190534 ymin: 4817361 xmax: 747765 ymax: 5468774
Projected CRS: NAD83 / UTM zone 15N
Reading layer `PostsecondaryEnrollment' from data source 
  `/Users/apple/Documents/GitHub/212/STAT212_final_project/data/shp_society_post_second_enroll' 
  using driver `ESRI Shapefile'
Simple feature collection with 212 features and 73 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -97.01679 ymin: 42.49901 xmax: -87.69674 ymax: 48.59
Geodetic CRS:  NAD83
Getting data from the 2018-2022 5-year ACS
Warning: • You have not set a Census API key. Users without a key are limited to 500
queries per day and may experience performance limitations.
ℹ For best results, get a Census API key at
http://api.census.gov/data/key_signup.html and then supply the key to the
`census_api_key()` function to use it throughout your tidycensus session.
This warning is displayed once per session.
Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |======================================================================| 100%
`summarise()` has grouped output by 'tract', 'population', 'medianIncome',
'White Alone', 'Black or African American Alone', 'Asian Alone', 'American
Indian and Alaska Native Alone'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'tract'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'nYear'. You can override using the
`.groups` argument.

Mapping sthe Education Sources in Twin Cities

schools_stops %>% 
  ggplot(aes(x = number_school, y = number_of_stops))+
  stat_summary_bin(fun = "mean", bins = 5, geom = "point")+ #divides the number of schools by the number of bins, so Bin 1: 0-10 schools, Bin 2: 10:20, etc. 
  geom_smooth(method = "lm")+
  labs(x = "Mean Number of Schools (Pre and Post-Secondary) in Tract", 
       y = "Mean Number of Bus Stops in Tract")+ # avg # of stops in areas with similar # of schools
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

schools_stops %>% 
  ggplot(aes(x = number_school, y = number_of_stops))+
  geom_point(alpha = 0.5)+
  geom_smooth(method = "lm")+
  labs(x = "Number of Schools (Pre and Post-Secondary) in Tract", y = "Number of Bus Stops in Tract")+ 
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

function

map

# need help:  don't know how to add title with var?

census_var_plot <- function(var, title){
  # using chosen variable to make the map with bus stops
  ggplot()+
  geom_sf(data = census2023, aes(fill = {{var}}))+
  geom_sf(data = stops_sf_county, color = "green", size = 0.1)+
  scale_fill_gradient(low = "orange", high = "blue")+
    labs(fill="", title = str_to_title(title))
}

census_var_perc_plot <- function(var, title){
  # using the percentage of each race  to make the map with bus stops
  ggplot()+
  geom_sf(data = census2023, aes(fill = {{var}}/population))+
  geom_sf(data = stops_sf_county, color = "green", size = 0.1)+
  scale_fill_gradient(low = "orange", high = "blue")+
   labs(fill="", title = str_c("Percentage of ",title))
}

for(i in 1:2){
  print(census_var_plot(get(name[i]), name[i]))
}

for(i in 3:9){
  print(census_var_perc_plot(get(name[i]), name[i]))
}

scatter plot

census_point_plot <- function(var, title){
 # Function for plotting the linear relationship between chosen variables and the number of bus stops
  stops_census_join_summ %>% 
  ggplot(aes(x = {{var}}, y = number_of_stops))+
    geom_point()+
    geom_smooth(method = "lm")+
    labs(y = "Number of Bus Stops in Tract",title = str_to_title(title))+
    theme_classic()
}

census_point_density_plot <- function(var, title){
   # Function for plotting the linear relationship between
  # percentage of each ethnicity group and the number of bus stops
  stops_census_join_summ %>% 
  ggplot(aes(x = {{var}}/population, y = number_of_stops))+
    geom_point()+
    geom_smooth(method = "lm")+
    labs(y = "Number of Bus Stops in Tract", title = str_c("Percentage of ",title))+
    theme_classic()
}



for(i in 1:2){
  print(census_point_plot(get(name[i]), name[i]))
}
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

for(i in 3:7){
  print(census_point_density_plot(get(name[i]), name[i]))
}
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

Overview and Covid comparison

# total ridership over 5 years
used_ridership_all %>% 
  group_by(nYear) %>%
  mutate(Total_Riders = sum(Total_Riders,na.rm = TRUE)) %>% 
  ggplot(aes(x=nYear, y=Total_Riders))+
  geom_point()+
  geom_line()+
  labs(x = "Year", y = "Total Riders")

# 2 comparisons included: weekday vs weekend, and among each year
weekends_weekdays %>% 
  ggplot(aes(x= nYear, y = Total_Riders)) +
  geom_point()+
  geom_line()+
  facet_wrap(~Schedule)+
  labs(x = "Year", y = "Total Riders")

used_ridership_all %>% filter(!(rte_class %in% c("CommRai","SuburbL","Support"))) %>% 
  group_by(nYear, rte_class) %>%
  mutate(Total_Riders = sum(Total_Riders,na.rm = TRUE)) %>% 
  ggplot(aes(x=nYear, y = Total_Riders)) +
  geom_point()+
  geom_line()+
  facet_wrap(~rte_class)+
  labs(x = "Year", y = "Total Riders")

# total ridership for BRT increase because increasing lines
used_ridership_all%>% 
  filter(rte_class == "BRT") %>% 
  distinct(nYear,Route) %>% 
  group_by(nYear) %>% 
  summarise(num_BRT = n()) %>% 
  ggplot(aes(x =nYear, y = num_BRT))+
  geom_point()+
  geom_line() # the increase of BRT can replace one of the orginal line

# daily load drops
used_ridership_all %>% 
  group_by(nYear) %>%
  mutate(Daily_load = n()) %>% 
  ggplot(aes(x=nYear, y= Daily_load))+
  geom_line()+
  geom_point()

# supply # of roads drops
used_ridership_all %>% 
  distinct(nYear,Route) %>% 
  group_by(nYear) %>% 
  mutate(nRoute = n()) %>% 
  ggplot(aes(x=nYear, y= nRoute))+
  geom_line()+
  geom_point()

cleaned_stops_census <- clean_names(stops_census_join_summ)

# outlier
cleaned_stops_census %>%
 select(population,black_or_african_american_alone,number_of_stops) %>% 
  ggplot(aes(x = black_or_african_american_alone, y= number_of_stops))+
  geom_point()
Adding missing grouping variables: `tract`, `median_income`, `white_alone`,
`asian_alone`, `american_indian_and_alaska_native_alone`

model <- lm(number_of_stops ~ population+median_income + white_alone + black_or_african_american_alone+asian_alone + american_indian_and_alaska_native_alone + native_hawaiian_and_other_pacific_islander_alone, cleaned_stops_census)

par(mfrow = c(2, 2))
plot(model)

cooksD <- cooks.distance(model)
influential <- cooksD[(cooksD > (3 * mean(cooksD, na.rm = TRUE)))]
as.numeric(names(influential)) 
 [1]  28  29  32  95  98 110 131 180 238 247 248 270 273 287 289 292 293 314 330
[20] 355 445
outlier <- rep(0, cleaned_stops_census%>% nrow() )
outlier[as.numeric(names(influential)) ] <- 1
outlier_info <- cleaned_stops_census %>% cbind(outlier = outlier) %>% filter(outlier == 1)

outlier_info_sf <- st_as_sf(outlier_info)


# We see that 22 tract have a Cook’s Distance greater than 3x the mean.

# how many sd from the linear regression because of unit difference
 ggplot()+
  geom_sf(data = census2023)+
   geom_sf(data = outlier_info_sf, fill = "red")+
  geom_sf(data = stops_sf_county, color = "green", size = 0.1, alpha = 0.3)

   # geom_text(data = outlier_info_sf, aes(label = tract, x = outlier_info_sf[["geometry"]][["X"]], y = outlier_info_sf[["geometry"]][["Y"]]), color = "black", size = 3)

Correlation Graph of All Variables:

# correlation <- function(df, var) {
#   cor_value <- cor(df[[var]], df$number_of_stops, use = "complete.obs")
#   return(cor_value)
# }

correlation_values <- vector("list", length = 7)
columns_to_iterate <- c("population", "median_income", "white_alone", "black_or_african_american_alone", "asian_alone", "american_indian_and_alaska_native_alone", "native_hawaiian_and_other_pacific_islander_alone")



for (col_name in columns_to_iterate) {
# calculate correlation values
  cor_value <- cor(cleaned_stops_census[[col_name]], cleaned_stops_census$number_of_stops, use = "complete.obs")
# confidence intervals
  mod_formula_str <- str_c("number_of_stops", "~", col_name)
  mod_form <- as.formula(mod_formula_str)
  mod <- lm(mod_form, data = cleaned_stops_census)
  ci <- confint(mod, level = 0.95)
# table of data
  correlation_values[[col_name]] <- tibble(
    variable = col_name,
    correlation = cor_value,
    ci_lower = ci[variable, "2.5 %"],
    ci_upper = ci[variable, "97.5 %"],
    slope = mod$coefficients[variable]
  )
}

correlation_dataframe <- bind_rows(correlation_values)
head(correlation_dataframe)
# A tibble: 6 × 5
  variable                                correlation ci_lower ci_upper    slope
  <chr>                                         <dbl>    <dbl>    <dbl>    <dbl>
1 population                                   0.276   3.31e-3  6.36e-3  4.83e-3
2 median_income                               -0.119  -1.29e-4 -1.77e-5 -7.36e-5
3 white_alone                                  0.182   1.82e-3  5.32e-3  3.57e-3
4 black_or_african_american_alone              0.131   1.66e-3  8.95e-3  5.31e-3
5 asian_alone                                  0.0651 -1.24e-3  7.61e-3  3.19e-3
6 american_indian_and_alaska_native_alone      0.0483 -1.80e-2  5.92e-2  2.06e-2
correlation_dataframe %>% 
  ggplot(aes(x = variable, y = slope))+
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2)+
    geom_point(size = 2)+
  labs(x = "Strength of Different Variables in Relation to Number of Stops")+
  theme_minimal()

Identifying the Outliers:

# demographics as per capita

outliers_demographic_pc <- cleaned_stops_census %>% 
 mutate(white_pc = white_alone/population, 
        black_pc = black_or_african_american_alone/population,
        asian_pc = asian_alone/population,
        american_indian_and_alaska_native_pc = american_indian_and_alaska_native_alone/population,
        native_hawaiian_other_pi_pc = native_hawaiian_and_other_pacific_islander_alone/population)

# The stop outliers that have more than 100 total number of bus stops are:

outliers_demographic_pc %>% 
  filter(number_of_stops > 100) %>% 
  arrange(desc(number_of_stops))
# A tibble: 6 × 15
# Groups:   tract, population, median_income, white_alone,
#   black_or_african_american_alone, asian_alone,
#   american_indian_and_alaska_native_alone [6]
  tract  population median_income white_alone black_or_african_ame…¹ asian_alone
  <chr>       <dbl>         <dbl>       <dbl>                  <dbl>       <dbl>
1 1261.…       6973        120571        4518                   1367         468
2 265.1…       5182         97426        3617                    363         703
3 1044;        2752         68272        1535                    643         252
4 251;         3428         59844        1974                    427         393
5 342.0…       3638         74821        2475                    212         675
6 216.0…       5768         73740        4536                    230         197
# ℹ abbreviated name: ¹​black_or_african_american_alone
# ℹ 9 more variables: american_indian_and_alaska_native_alone <dbl>,
#   native_hawaiian_and_other_pacific_islander_alone <dbl>,
#   number_of_stops <int>, geometry <POLYGON [°]>, white_pc <dbl>,
#   black_pc <dbl>, asian_pc <dbl>, american_indian_and_alaska_native_pc <dbl>,
#   native_hawaiian_other_pi_pc <dbl>

Tract 1261.02 is located near the U of M campus in Minneapolis which supports the hypothesis that there would be more stops (greater access to public transportation) for individuals who likely use it more - students. It also has the largest population size out of our narrowed-down list. This would also likely be a viable area for professionals to reside away from downtown Minneapolis but still in commuting distance to their jobs.

Tract 1261.02

Tract 265.14 is located in Plymouth, MN a large suburb near the Twin Cities. That explains their relatively high median income levels. This tract is also mainly where the bus stops are located to get around the suburb, explaining the high level of stops.

Tract 265.14

Tract 1044 is located in central Minneapolis which also makes intuitive sense why it is in the top three for number of stops. There is often more accessibility located in the downtown areas of a city to provide ongoers an easy way to move around. There is likely less population because downtowns tend to be where businesses are located, so many workers probably commute to work from other locations.

Tract 1044

Tract 251 is one the largest tract sizes which might explain partly why it has so many bus stops. However, another reason might be because Mall of America is located within this tract which prompts there to be more public transportation around this area to get to the Mall as it is a major destination within the Twin Cities.

Tract 251 > Tract 342.01 is downtown St. Paul. Similar to the city cases located above, downtown St. Paul would have a characteristically higher number of bus stops to allow citygoers a more convenient way to get downtown or move around downtown. Once again, less people tend to live downtown but rather in the areas around.

Tract 342.01

Tract 216.02 is in Golden Valley, MN. Golden Valley is a first-ring suburb of Minneapolis meaning that there is still public transportation access to provide the residents to commute into the city as they so please. They would need more stops in order to fully supply the residents their trips into the city.

Tract 216.02
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] janitor_2.2.0   Hmisc_5.1-1     sf_1.0-15       tidycensus_1.6 
 [5] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   purrr_1.0.2    
 [9] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.4  
[13] tidyverse_2.0.0 dplyr_1.1.3    

loaded via a namespace (and not attached):
 [1] gtable_0.3.4       xfun_0.43          htmlwidgets_1.6.2  tigris_2.0.4      
 [5] lattice_0.21-9     tzdb_0.4.0         vctrs_0.6.5        tools_4.3.2       
 [9] generics_0.1.3     curl_5.2.0         proxy_0.4-27       fansi_1.0.6       
[13] cluster_2.1.4      pkgconfig_2.0.3    Matrix_1.6-1.1     KernSmooth_2.23-22
[17] data.table_1.15.2  checkmate_2.3.1    uuid_1.1-1         lifecycle_1.0.4   
[21] farver_2.1.1       compiler_4.3.2     munsell_0.5.0      snakecase_0.11.1  
[25] htmltools_0.5.6    class_7.3-22       yaml_2.3.7         htmlTable_2.4.2   
[29] Formula_1.2-5      pillar_1.9.0       crayon_1.5.2       classInt_0.4-10   
[33] wk_0.9.0           rpart_4.1.21       nlme_3.1-163       tidyselect_1.2.0  
[37] rvest_1.0.3        digest_0.6.33      stringi_1.7.12     labeling_0.4.3    
[41] splines_4.3.2      fastmap_1.1.1      grid_4.3.2         colorspace_2.1-0  
[45] cli_3.6.2          magrittr_2.0.3     base64enc_0.1-3    utf8_1.2.4        
[49] e1071_1.7-14       foreign_0.8-85     withr_3.0.0        backports_1.4.1   
[53] scales_1.3.0       rappdirs_0.3.3     timechange_0.2.0   rmarkdown_2.24    
[57] httr_1.4.7         nnet_7.3-19        gridExtra_2.3      hms_1.1.3         
[61] evaluate_0.21      knitr_1.43         mgcv_1.9-0         s2_1.1.4          
[65] rlang_1.1.3        Rcpp_1.0.11.6      glue_1.7.0         DBI_1.1.3         
[69] xml2_1.3.5         rstudioapi_0.15.0  jsonlite_1.8.7     R6_2.5.1          
[73] units_0.8-4